Current Issue : April-June Volume : 2024 Issue Number : 2 Articles : 5 Articles
The automatic identification of emotions from speech holds significance in facilitating interactions between humans and machines. To improve the recognition accuracy of speech emotion, we extract mel-frequency cepstral coefficients (MFCCs) and pitch features from raw signals, and an improved differential evolution (DE) algorithm is utilized for feature selection based on K-nearest neighbor (KNN) and random forest (RF) classifiers. The proposed multivariate DE (MDE) adopts three mutation strategies to solve the slow convergence of the classical DE and maintain population diversity, and employs a jumping method to avoid falling into local traps. The simulations are conducted on four public English speech emotion datasets: eNTERFACE05, Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Surrey Audio-Visual Expressed Emotion (SAEE), and Toronto Emotional Speech Set (TESS), and they cover a diverse range of emotions. The MDE algorithm is compared with PSO-assisted biogeography-based optimization (BBO_PSO), DE, and the sine cosine algorithm (SCA) on emotion recognition error, number of selected features, and running time. From the results obtained, MDE obtains the errors of 0.5270, 0.5044, 0.4490, and 0.0420 in eNTERFACE05, RAVDESS, SAVEE, and TESS based on the KNN classifier, and the errors of 0.4721, 0.4264, 0.3283 and 0.0114 based on the RF classifier. The proposed algorithm demonstrates excellent performance in emotion recognition accuracy, and it finds meaningful acoustic features from MFCCs and pitch....
Stress can be considered a mental/physiological reaction in conditions of high discomfort and challenging situations. The levels of stress can be reflected in both the physiological responses and speech signals of a person. Therefore the study of the fusion of the two modalities is of great interest. For this cause, public datasets are necessary so that the different proposed solutions can be comparable. In this work, a publicly available multimodal dataset for stress detection is introduced, including physiological signals and speech cues data. The physiological signals include electrocardiograph (ECG), respiration (RSP), and inertial measurement unit (IMU) sensors equipped in a smart vest. A data collection protocol was introduced to receive physiological and audio data based on alterations between well-known stressors and relaxation moments. Five subjects participated in the data collection, where both their physiological and audio signals were recorded by utilizing the developed smart vest and audio recording application. In addition, an analysis of the data and a decision-level fusion scheme is proposed. The analysis of physiological signals includes a massive feature extraction along with various fusion and feature selection methods. The audio analysis comprises a state-of-the-art feature extraction fed to a classifier to predict stress levels. Results from the analysis of audio and physiological signals are fused at a decision level for the final stress level detection, utilizing a machine learning algorithm. The whole framework was also tested in a real-life pilot scenario of disaster management, where users were acting as first responders while their stress was monitored in real time....
Speech synthesis is a technology that converts text into speech waveforms. With the development of deep learning, neural network-based speech synthesis technology is being researched in various fields, and the quality of synthesized speech has significantly improved. In particular, Grad-TTS, a speech synthesis model based on the denoising diffusion probabilistic model (DDPM), exhibits high performance in various domains, generates high-quality speech, and supports multispeaker speech synthesis. However, speech synthesis for an unseen speaker is not possible. Therefore, this study proposes an effective zero-shot multi-speaker speech synthesis model that improves the Grad-TTS structure. The proposed method enables the reception of speaker information from speech references using a pre-trained speaker recognition model. In addition, by converting speaker information via information perturbation, the model can learn various types of speaker information, excluding those in the dataset. To evaluate the performance of the proposed method, we measured objective performance indicators, namely speaker encoder cosine similarity (SECS) and mean opinion score (MOS). To evaluate the synthesis performance for both the seen speaker and unseen speaker scenarios, Grad-TTS, SC-GlowTTS, and YourTTS were compared. The results demonstrated excellent speech synthesis performance for seen speakers and a performance similar to that of the zero-shot multi-speaker speech synthesis model....
Dimensional emotion can better describe rich and fine-grained emotional states than categorical emotion. In the realm of human–robot interaction, the ability to continuously recognize dimensional emotions from speech empowers robots to capture the temporal dynamics of a speaker’s emotional state and adjust their interaction strategies in real-time. In this study, we present an approach to enhance dimensional emotion recognition through modulation-filtered cochleagram and parallel attention recurrent neural network (PA-net). Firstly, the multi-resolution modulationfiltered cochleagram is derived from speech signals through auditory signal processing. Subsequently, the PA-net is employed to establish multi-temporal dependencies from diverse scales of features, enabling the tracking of the dynamic variations in dimensional emotion within auditory modulation sequences. The results obtained from experiments conducted on the RECOLA dataset demonstrate that, at the feature level, the modulation-filtered cochleagram surpasses other assessed features in its efficacy to forecast valence and arousal. Particularly noteworthy is its pronounced superiority in scenarios characterized by a high signal-to-noise ratio. At the model level, the PA-net attains the highest predictive performance for both valence and arousal, clearly outperforming alternative regression models. Furthermore, the experiments carried out on the SEWA dataset demonstrate the substantial enhancements brought about by the proposed method in valence and arousal prediction. These results collectively highlight the potency and effectiveness of our approach in advancing the field of dimensional speech emotion recognition....
This paper uses the ARIMA model in time series and the BLSTM sentiment classification algorithm in sentiment analysis to predict the elements and the direction of traditional music heritage. Differential processing of non-smooth series data stabilizes the time series data of traditional music inheritance. By extracting all the features contained in the sheet music, the interpreted sheet music is subjected to sentiment analysis to further analyze the inheritable elements of traditional music. The results show that the time series model has high accuracy in predicting the inheritable elements of traditional music, the MAPE value of the ARIMA model is 5.9658308, and the melody, as well as the structure of traditional music, can be integrated into piano teaching in colleges and universities to a certain extent, with the integration degree of the melody being 0.72 and the integration degree of the structure is 0.655....
Loading....